{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pickle\n",
    "import pandas as pd\n",
    "from sklearn.externals import joblib"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Prerequisites\n",
    "\n",
    "In order to use the notebook you **must** set the value for the variables in the cell below.  Simply set each variable to a string representing the corresponding filename.  For example:\n",
    "\n",
    "**BEFORE**\n",
    "```python\n",
    "vectorizer_file = # Replace this comment with Vectorizer Filename\n",
    "labeled_data_file = # Repalce this comment with Labeled Data Csv Filename\n",
    "tfidf_matrix_file = # Replace this comment with Tfidf Matrix Pkl Filename\n",
    "model_training_file = # Replace this comment with Model Training Pkl Filename\n",
    "label_file = # Replace this comment with Label Csv Filename\n",
    "```\n",
    "\n",
    "**AFTER**\n",
    "```python\n",
    "vectorizer_file = \"project_2_vectorizer.pkl\"\n",
    "labeled_data_file = \"project_2_labeled_data.csv\"\n",
    "tfidf_matrix_file = \"project_2_tfidf_matrix.pkl\"\n",
    "model_training_file = \"project_2_training_2.pkl\"\n",
    "label_file = \"project_2_labels.csv\"\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "vectorizer_file = # Replace this comment with Vectorizer Filename\n",
    "labeled_data_file = # Repalce this comment with Labeled Data Csv Filename\n",
    "tfidf_matrix_file = # Replace this comment with Tfidf Matrix Pkl Filename\n",
    "model_training_file = # Replace this comment with Model Training Pkl Filename\n",
    "label_file = # Replace this comment with Label Csv Filename"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4.A. Preprocessing new data for the model to predict\n",
    "\n",
    "The following sample code can be used to preprocess new data and get it in a format that is usable by the saved model.\n",
    "\n",
    "\n",
    "**Note:** to see what words the fields in the transformed TFIDF matrix are based off of, just call ```print(tfidf_transformer.vocablulary_)```. This will print a dictionary of the words that were used and their index in the transformed tfidf_matrix."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "with open(vectorizer_file, \"rb\") as tfidf_vectorizer:\n",
    "    # load the vectorizer\n",
    "    tfidf_transformer = pickle.load(tfidf_vectorizer)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "new_data = [\n",
    "    \"this is new data\",\n",
    "    \"that needs to be formatted\",\n",
    "    \"In the same way that the other data was before\",\n",
    "]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# apply the vectorizer to new data\n",
    "transformed_data = tfidf_transformer.transform(new_data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4.B. Applying the model to predict new data\n",
    "\n",
    "The following sample code can be used to read in the zipped files and get predictions for the unlabeled data.\n",
    "\n",
    "**NOTE:** the model will predict labels as a number. The project\\_\\#_labels.csv gives the mapping between label text and ID."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# read in the TFIDF matrix and the labeled data\n",
    "labeled_frame = pd.read_csv(labeled_data_file)\n",
    "with open(tfidf_matrix_file,\"rb\") as tfidf_file:\n",
    "    tfidf_dict = pickle.load(tfidf_file)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Subset the TFIDF matrix by the unlabeled data and get as 2D list\n",
    "labeled_ids = labeled_frame[\"ID\"].tolist()\n",
    "unlabeled = [tfidf_dict[key] for key in tfidf_dict if key not in labeled_ids]\n",
    "\n",
    "# read in the model from the pickle file\n",
    "model = joblib.load(model_training_file)\n",
    "\n",
    "# apply the model to the unlabeled data\n",
    "predictions = model.predict(unlabeled)\n",
    "\n",
    "# print the results\n",
    "print(predictions)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# open the label dictionary to translate label ID's to their corrosponding text\n",
    "labels_frame = pd.read_csv(label_file)\n",
    "label_dict = labels_frame.set_index(\"Label_ID\").to_dict()[\"Name\"]\n",
    "\n",
    "# get the predictions as actual labels\n",
    "predictions = [label_dict[pred] for pred in predictions]\n",
    "print(predictions)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
